Skip to content

feat(scan): pure-TS ripgrep-compatible scanner + DSN migration + perf overhaul#791

Merged
BYK merged 7 commits intomainfrom
byk/feat/ripgrep
Apr 21, 2026
Merged

feat(scan): pure-TS ripgrep-compatible scanner + DSN migration + perf overhaul#791
BYK merged 7 commits intomainfrom
byk/feat/ripgrep

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Apr 20, 2026

Summary

Introduces src/lib/scan/ — a pure-TypeScript, dependency-free-at-runtime
ripgrep-compatible file scanner — and migrates the DSN scanner at
src/lib/dsn/code-scanner.ts to consume it. Also ships a local benchmark
harness with a data-driven concurrency tuning pass, a whole-buffer
matchAll rewrite of the grep engine, and a two-tier IgnoreStack
optimization.

The DSN scanner shrinks from 715 → 435 LOC (−280) while keeping all
public export signatures unchanged. DSN detection's common-case hot paths
(detectDsn.cold, scanCodeForFirstDsn) become 46-87× faster. The
exhaustive find-all path holds near parity with pre-change numbers
(+4.7%, well under the 20% budget) while gaining nested-.gitignore
correctness and dirMtimes cache coverage.

What's in the module

src/lib/scan/ (10 files, ~2.7k LOC, exported via a barrel):

  • walker.ts — DFS async generator, streaming readdir({withFileTypes}), time-budgeted exploration, monorepo descent hook, onDirectoryVisit observer for cache-invalidation callers
  • ignore.tsIgnoreStack wrapping the ignore npm package with a root-only fast path + per-dir nested .gitignore support (matches git's cumulative semantics)
  • binary.ts — 8 KB NUL-byte sniff + extension fast path for text/binary classification
  • concurrent.ts — shared mapFilesConcurrent / mapFilesConcurrentStream with cooperative early-exit flag (never uses pLimit.clearQueue() — that hangs Promise.all)
  • regex.tscompilePattern (rg-style leading inline flags → JS RegExp flags) + ensureGlobalFlag / ensureGlobalMultilineFlags helpers
  • grep.tsgrepFiles (AsyncIterable<GrepMatch>) + collectGrep (sorted Promise<GrepResult>). Whole-buffer matchAll iteration.
  • glob.tsglobFiles + collectGlob, picomatch-backed
  • constants.tsTEXT_EXTENSIONS, DEFAULT_SKIP_DIRS, DSN_ADDITIONAL_SKIP_DIRS, MAX_FILE_SIZE, CONCURRENCY_LIMIT, helpers (normalizePath, isMonorepoPackageDir)
  • types.tsWalkEntry, WalkOptions, GrepMatch, GrepOptions, GlobOptions, IgnoreMatcher
  • index.ts — public barrel

DSN-specific policy lives in src/lib/dsn/scan-options.ts. scan/ stays policy-free so the init wizard (follow-up PR) can bring its own preset.

DSN scanner migration (src/lib/dsn/code-scanner.ts)

Before After
LOC 715 435 (−280)
Public exports 5 5 (signatures unchanged)
Telemetry span + attrs scanCodeForDsns / dsn.detect.code unchanged
Nested .gitignore root-only cumulative (git semantics)
Early-exit logic ad-hoc state.earlyExit shared mapFilesConcurrent.onResult

DSN-specific content extraction (extractDsnsFromContent, isCommentedLine, isValidDsnHost, getExpectedHost) stays in code-scanner.ts unchanged. The walker + concurrency + gitignore handling moved to scan/.

Added a case-insensitive literal-prefix fast path to extractDsnsFromContent (/http/i.test(content)) that short-circuits matchAll when the file contains no scheme prefix in any casing.

Performance (synthetic/large, 10k files, 4-core)

Op Pre-PR baseline Now Δ
detectDsn.cold 179 ms 3.76 ms 48× faster
scanCodeForFirstDsn 183 ms 2.10 ms 87× faster
scanCodeForDsns 298 ms 312 ms +4.7% (under 20% bar)
detectAllDsns.cold 304 ms 318 ms +4.6%
scan.grepFiles (new op) 283 ms 0.95× of OLD scanCodeForDsns
scan.ignore (µbench) (new op) 2.09 ms 81% faster than pre-opt

The 46-87× wins on the stop-on-first hot paths come from mapFilesConcurrent's cooperative early-exit flag — the old scanner's stop-on-first was less aggressive. That's the path every sentry issue list invocation hits.

The small regression on scanCodeForDsns is the cost of nested-gitignore correctness (~41ms single-threaded) plus the case-insensitive scheme probe (~15ms vs raw indexOf). Both are correctness improvements; the regression is within the planning budget (20%).

Benchmark harness

script/bench.ts + script/bench-sweep.ts, deterministic xorshift32-seeded fixture generator at test/fixtures/bench/. Three presets (small ~100 files, medium ~1k, large ~10k). Baselines gitignored (machine-specific).

bun run bench                 # all ops, all sizes
bun run bench --size large --runs 10
bun run bench --save-baseline # .bench/baseline.json (local-only)
bun run bench --compare       # fail on >20% regression
bun run bench:sweep           # concurrency sweep to tune CONCURRENCY_LIMIT

Op labels intentionally match Sentry span names (scanCodeForDsns, findProjectRoot, …) so local measurements correlate with production telemetry.

Optimizations landed

  1. mapFilesConcurrent + per-file extractDsnsFromContent(content, limit: 1) — the early-exit wins (46-87× on stop-on-first).
  2. CONCURRENCY_LIMIT tied to availableParallelism() with floor 2 and cap 16. The JSDoc documents the tradeoff between walker-fed (our real workload) and pure-I/O workloads, and notes that callers can override per-call.
  3. grep.ts::readAndGrep whole-buffer matchAll — replaced content.split("\n") + per-line regex.test with regex.exec + lazy line-num counting. 12× faster on the profiled grep CPU phase.
  4. Case-insensitive literal-prefix fast path in DSN/http/i.test(content) short-circuits matchAll on the ~92% of files that can't possibly contain a DSN. Must be case-insensitive for correctness.
  5. IgnoreStack two-tier query — root-only fast path when no nested gitignores loaded (14× per-query speedup); prefix-walk slow path for the nested case (3× speedup). The scan.ignore microbench dropped from 10.9 ms → 2.1 ms.

Non-goals, explicitly rejected with evidence

  • Worker threads: profiling showed CPU work (45 ms DSN regex on the large fixture) is smaller than worker bootstrap (~40 ms for 4 workers) + postMessage serialization (~50-200 ms to move 80 MB of content). Net negative. A 12× single-threaded win in grep.ts covered the gap instead.
  • Replace ignore with picomatch: gitignore syntax (bare basename match, trailing / = subtree, leading / anchored) isn't compatible with picomatch glob syntax. Would require a 200-400 LOC translator with subtle correctness edge cases. ignore package is ~800 LOC specifically because this is non-trivial.

Behavior change (intentional)

Nested .gitignore now honored in DSN detection. Pre-change: only the project-root .gitignore was read. Post-change: every directory's .gitignore applies to its subtree with cumulative semantics (parent patterns apply, child negations override).

User-visible impact: a DSN in a file covered by a subdirectory .gitignore is no longer detected. This matches git's own check-ignore behavior and is almost always what users want (a gitignored file should not leak DSNs to the CLI). If anyone depends on the old behavior, they can remove the relevant .gitignore rule.

Review round fixes

  • Fixed mixed-case DSN regression (Https://, hTtP://, etc. were silently dropped by the initial indexOf-based probe). The probe is now /http/i.test(content). Widened the property test arbitrary to include uppercase so this class is actually exercised. Added 5 parametrized regression tests in code-scanner.test.ts.
  • Reverted an overly-aggressive CONCURRENCY_LIMIT bump (cores × 4, cap 32) that microbenchmarked well but regressed the real walker-fed workload ~15%. The JSDoc now documents why the "I/O-bound therefore scale with something larger than cores" reasoning is misleading for this specific pipeline.

What's NOT in this PR

  • Init wizard migration (replace src/lib/init/tools/grep.ts + glob.ts subprocess shims with collectGrep/collectGlob). Separate follow-up PR with its own plan.
  • Worker pool. Rejected above.

Test plan

  • bun run typecheck — clean
  • bun run lint — clean (1 pre-existing warning in markdown.ts, not from this PR)
  • bun test --timeout 15000 test/lib test/commands test/types5464 pass, 0 fail (+183 new scan + bench + DSN tests including property tests via fast-check + 5 mixed-case regression tests)
  • bun test test/isolated — 138 pass, 0 fail
  • bun run bench --size largescanCodeForDsns within 1.2× of pre-PR baseline (312 ms vs 298 ms = 1.05×)
  • Semantic parity verified: DSN detection on the medium fixture produces 4 unique DSNs, 76 sourceMtimes, 461 dirMtimes — identical to pre-change scanner
  • Mixed-case scheme casings detected: Https://, hTtP://, HttpS:// — all pass

🤖 Generated with Claude Code

Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

Semver Impact of This PR

🟡 Minor (new features)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).


New Features ✨

  • (scan) Pure-TS ripgrep-compatible scanner + DSN migration + perf overhaul by BYK in #791

Bug Fixes 🐛

Internal Changes 🔧

  • (init) Trim deprecated --features help entries by MathurAditya724 in #781
  • Regenerate docs by github-actions[bot] in 58a84035

🤖 This preview updates automatically when you update the PR.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/_preview/pr-791/

Built to branch gh-pages at 2026-04-21 10:59 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

Codecov Results 📊

138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric Change
Total Tests
Passed Tests
Failed Tests
Skipped Tests

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 98.05%. Project has 1730 uncovered lines.
✅ Project coverage is 95.65%. Comparing base (base) to head (head).

Files with missing lines (6)
File Patch % Lines
src/lib/scan/walker.ts 95.42% ⚠️ 13 Missing
src/lib/scan/grep.ts 96.62% ⚠️ 8 Missing
src/lib/dsn/code-scanner.ts 94.39% ⚠️ 6 Missing
src/lib/scan/regex.ts 98.39% ⚠️ 2 Missing
src/lib/scan/concurrent.ts 99.12% ⚠️ 1 Missing
src/lib/scan/glob.ts 98.21% ⚠️ 1 Missing
Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
+ Coverage    95.58%    95.65%    +0.07%
==========================================
  Files          264       278       +14
  Lines        38454     39794     +1340
  Branches         0         0         —
==========================================
+ Hits         36755     38064     +1309
- Misses        1699      1730       +31
- Partials         0         0         —

Generated by Codecov Action

Comment thread src/lib/dsn/code-scanner.ts
Comment thread src/lib/scan/grep.ts Outdated
Comment thread src/lib/scan/grep.ts Outdated
Comment thread src/lib/scan/walker.ts
BYK added a commit that referenced this pull request Apr 21, 2026
…dedup

Three findings from Cursor Bugbot + Seer on PR #791:

**Seer (MEDIUM): followSymlinks: true silently drops symlinks**

`readdir({withFileTypes: true})` uses lstat semantics — on a symlink,
`isSymbolicLink()` returns true while `isFile()` and `isDirectory()`
both return false. The existing `processEntry` only routed entries
via `isDirectory()` / `isFile()`, so a symlink with `followSymlinks:
true` passed the skip check but then fell through and returned null.
The `inodeKey`/`visitedInodes` cycle detection in `maybeDescend` was
unreachable dead code.

Fix: when a symlink is encountered with `followSymlinks: true`,
`stat()` (not `lstat`) the target to learn its real type, route to
`maybeDescend` or `tryYieldFile` accordingly. Broken symlinks and
stat errors are silently skipped (matches ripgrep's behavior).
Circular symlinks are broken by the existing inode-based cycle
detection.

Tests: 4 new symlink cases in `walker.test.ts` covering default-skip,
follow-through, circular cycle detection, and broken symlinks.

**Cursor #2 (MEDIUM): concurrent grep workers share stateful RegExp**

When a pre-compiled `/gm` RegExp was passed to `grepFiles`,
`ensureGlobalMultilineFlags` returned it unchanged. Concurrent
`readAndGrep` workers then shared the regex's `lastIndex` property.
Strictly speaking this was safe today — JS is single-threaded and
the match loop has no `await`, so workers can't actually interleave
mid-iteration — but any future `await` inside the loop would turn it
into a silent match-dropping bug.

Fix: always clone the regex at the start of `readAndGrep` via `new
RegExp(src, flags)`. ~1µs per file, eliminates the foot-gun. Added a
regression test that exercises the shared-regex path with multiple
concurrent workers.

**Cursor #1 (LOW): shared utilities duplicated between grep and glob**

`compileMatchers`, `matchesAny`, `basenameOf`, `walkerRoot`, and
`joinPosix` were duplicated between `src/lib/scan/grep.ts` and
`glob.ts`. Extracted to new internal module `src/lib/scan/path-utils.ts`
(not part of the public barrel — implementation detail). Net LOC
reduction: ~80 lines across the two files.

## Test plan

- [x] `bun run typecheck` — clean
- [x] `bun run lint` — clean (1 pre-existing warning)
- [x] `bun test test/lib test/commands test/types` — **5443 pass, 0 fail** (+5 from this commit)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large` — no regression on `scanCodeForDsns` (312ms) or `scan.grepFiles` (278ms)
Comment thread src/lib/dsn/code-scanner.ts Outdated
Comment thread src/lib/scan/grep.ts
Comment thread src/lib/scan/concurrent.ts
Comment thread src/lib/scan/grep.ts
Comment thread src/lib/dsn/code-scanner.ts Outdated
Comment thread src/lib/scan/grep.ts
BYK added 5 commits April 21, 2026 10:32
… overhaul

Introduces `src/lib/scan/` — a pure-TypeScript, dependency-free-at-runtime
ripgrep-compatible file scanner (walker + ignore + binary detection + grep +
glob) — and migrates the DSN scanner at `src/lib/dsn/code-scanner.ts` to
consume it. Also ships a local benchmark harness, a data-driven concurrency
tuning pass, and two rounds of `IgnoreStack` optimization.

- **`src/lib/scan/`** (~2.7k LOC across 10 source files):
  walker (DFS with time budget + monorepo descent hook + gitignore), grep
  (`AsyncIterable<GrepMatch>` + `collectGrep`), glob (picomatch-backed),
  shared `mapFilesConcurrent` helper, regex helpers
  (`compilePattern`, `ensureGlobalFlag`, `ensureGlobalMultilineFlags`).
  Policy-free core — the DSN scanner composes its own preset via
  `src/lib/dsn/scan-options.ts`.

- **DSN scanner migration**: `src/lib/dsn/code-scanner.ts` shrinks from
  **715 → 435 LOC** (−280). All 5 public exports + wire shapes preserved
  byte-identical. Telemetry span/attributes unchanged. New behavior: respects
  nested `.gitignore` files (matches git's cumulative semantics — DSNs in
  files covered by a subdirectory `.gitignore` are no longer falsely
  detected).

- **Benchmark harness**: `script/bench.ts` + `script/bench-sweep.ts`,
  deterministic xorshift32-seeded fixture generator at
  `test/fixtures/bench/`. Three presets (small/medium/large ≈ 100/1k/10k
  files). `bun run bench`, `bench:save`, `bench:compare`, `bench:sweep`.
  Baselines gitignored (machine-specific).

- **Concurrency tuning**: data-driven sweep found the knee at
  `availableParallelism()` on a 4-core box (old default `50` was 8% slower
  at the fixture size). `CONCURRENCY_LIMIT` now `Math.max(2, cores)`.

- **IgnoreStack rewrite**: split into `#rootIg` + `#nestedByRelDir` with a
  root-only fast path. Per-query cost dropped from 3.76µs → 0.27µs (14×)
  on the common case, 5.76µs → 1.93µs (3×) on nested.

- **Grep whole-buffer matchAll**: replaced `content.split("\n")` + per-line
  `regex.test` with whole-buffer iteration via `regex.exec` + lazy line-number
  counting. 12× faster on profiled grep CPU phase.

- **Literal-prefix fast path** in DSN scanner: short-circuits `matchAll` when
  content has neither `"http"` nor `"HTTP"`. ~3% improvement on large fixture.

| Op                    | Baseline | Now    | Δ                      |
|-----------------------|---------:|-------:|------------------------|
| `detectDsn.cold`      |    179ms |  3.9ms | **46× faster**         |
| `scanCodeForFirstDsn` |    183ms |  2.1ms | **88× faster**         |
| `scanCodeForDsns`     |    298ms |  300ms | parity                 |
| `detectAllDsns.cold`  |    304ms |  310ms | parity                 |
| `scan.ignore` (µbench)| (new op) |  2.0ms | 81% faster than pre-opt|

The common-case hot paths (`detectDsn.cold`, `scanCodeForFirstDsn`) are
dramatically faster thanks to `mapFilesConcurrent`'s cooperative early-exit
flag. Find-all paths hold at parity while gaining nested-gitignore correctness,
richer cache coverage (`dirMtimes`), and a 40% LOC reduction.

- **Worker threads**: profiling showed CPU work (45ms DSN regex) is smaller
  than worker bootstrap + postMessage serialization. Net negative.
- **Replace `ignore` package with picomatch**: gitignore syntax isn't
  compatible with glob syntax. Would need a 200-400 LOC translator with
  subtle correctness edge cases.

- [x] `bun run typecheck` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in markdown.ts)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5433 pass, 0 fail** (+152 new scan + bench + dsn tests)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large` — all success criteria met, benchmarks stable
Addresses two issues raised in PR 791 code review:

**C1: Mixed-case DSN schemes silently dropped** (regression in PR 791)

The literal-prefix fast path in `extractDsnsFromContent` used two
case-sensitive `indexOf` calls (`"http"` and `"HTTP"`), which covered
only all-lowercase and all-uppercase scheme prefixes. Mixed-case URLs
like `Https://...` or `hTtP://...` returned `[]` even though the
`DSN_PATTERN` regex (with `/i` flag) would have matched.

Impact: any source file with non-standard scheme casing (rare but
possible in generated code or test fixtures) would silently fail
DSN auto-detection — `sentry issue list` would behave as if no DSN
existed. Also hits the cache-verifier path via
`extractFirstDsnFromContent`.

Fix: replaced the two `indexOf` calls with `/http/i.test(content)`.
Adds ~16ms on a 10k-file scan (V8 regex vs raw `indexOf`), a trade
we accept for correctness.

**C2: Property test couldn't catch the mixed-case regression**

The property test for the fast path used a lowercase-only arbitrary
charset, so it could never generate the input class where the old
fast path diverged from the slow path. Widened `BENIGN_CHARS` to
include uppercase letters; tightened the JSDoc explaining the
invariant.

Also added an explicit `test.each` regression suite in
`code-scanner.test.ts` pinning detection across five scheme casings
(all-lower, all-upper, title, mixed, http no-s).

**Concurrency retune: back to `availableParallelism`**

Previously set to `Math.max(2, availableParallelism())` in PR 3.5.
First attempted to raise to `cores × 4` (up to 32) based on a
microbenchmark showing that pure-I/O per-file work has a knee at
~16-32 in-flight tasks. Measurement was correct but didn't reflect
the real workload: every current scan-module caller is walker-fed,
and the walker's serial `readdir` descent is the bottleneck. Raising
concurrency regressed `scanCodeForDsns` ~15% because more in-flight
tasks created scheduling pressure without additional useful work.

Final choice: `Math.min(16, Math.max(2, availableParallelism()))` —
same as before this commit (4 on 4-core, 2 floor for tiny CI, 16 cap
for large workstations). The JSDoc now honestly documents the
tension between walker-fed and pure-I/O workloads and tells callers
they can override per-call via `opts.concurrency` when they know the
shape of their work.

- [x] `bun run typecheck` — clean
- [x] `bun run lint` — clean (1 pre-existing warning)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5438 pass, 0 fail** (+5 new parametrized regression tests)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large` — `scanCodeForDsns` 312ms (PR 0 baseline 298ms, +4.7%, under 1.2× bar); hot-path ops `detectDsn.cold` 3.76ms / `scanCodeForFirstDsn` 2.10ms unchanged (46-87× faster than baseline)
- [x] Manual: mixed-case DSN casings (`Https://`, `hTtP://`, `HttpS://`) all detected post-fix
…dedup

Three findings from Cursor Bugbot + Seer on PR #791:

**Seer (MEDIUM): followSymlinks: true silently drops symlinks**

`readdir({withFileTypes: true})` uses lstat semantics — on a symlink,
`isSymbolicLink()` returns true while `isFile()` and `isDirectory()`
both return false. The existing `processEntry` only routed entries
via `isDirectory()` / `isFile()`, so a symlink with `followSymlinks:
true` passed the skip check but then fell through and returned null.
The `inodeKey`/`visitedInodes` cycle detection in `maybeDescend` was
unreachable dead code.

Fix: when a symlink is encountered with `followSymlinks: true`,
`stat()` (not `lstat`) the target to learn its real type, route to
`maybeDescend` or `tryYieldFile` accordingly. Broken symlinks and
stat errors are silently skipped (matches ripgrep's behavior).
Circular symlinks are broken by the existing inode-based cycle
detection.

Tests: 4 new symlink cases in `walker.test.ts` covering default-skip,
follow-through, circular cycle detection, and broken symlinks.

**Cursor #2 (MEDIUM): concurrent grep workers share stateful RegExp**

When a pre-compiled `/gm` RegExp was passed to `grepFiles`,
`ensureGlobalMultilineFlags` returned it unchanged. Concurrent
`readAndGrep` workers then shared the regex's `lastIndex` property.
Strictly speaking this was safe today — JS is single-threaded and
the match loop has no `await`, so workers can't actually interleave
mid-iteration — but any future `await` inside the loop would turn it
into a silent match-dropping bug.

Fix: always clone the regex at the start of `readAndGrep` via `new
RegExp(src, flags)`. ~1µs per file, eliminates the foot-gun. Added a
regression test that exercises the shared-regex path with multiple
concurrent workers.

**Cursor #1 (LOW): shared utilities duplicated between grep and glob**

`compileMatchers`, `matchesAny`, `basenameOf`, `walkerRoot`, and
`joinPosix` were duplicated between `src/lib/scan/grep.ts` and
`glob.ts`. Extracted to new internal module `src/lib/scan/path-utils.ts`
(not part of the public barrel — implementation detail). Net LOC
reduction: ~80 lines across the two files.

- [x] `bun run typecheck` — clean
- [x] `bun run lint` — clean (1 pre-existing warning)
- [x] `bun test test/lib test/commands test/types` — **5443 pass, 0 fail** (+5 from this commit)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large` — no regression on `scanCodeForDsns` (312ms) or `scan.grepFiles` (278ms)
… truncation overshoot

Two follow-up findings from Cursor Bugbot on commit dd4fde0:

**Cursor (MEDIUM): error path leaks partial dirMtimes**

The `scanCodeForDsns` catch block was returning
`{ dsns: [], sourceMtimes: {}, dirMtimes }` where the last field
carried partial state populated by `onDirectoryVisit` callbacks that
fired before the walker threw. If a downstream caller cached this
partial result, the cache verifier would later only check the dirs
we happened to reach pre-error and silently bless the cache for dirs
the walker never visited — missing DSNs in those unvisited dirs.

Fix: return all three maps empty on error. Matches the pre-PR-3
scanner's behavior and forces a full rescan on the next attempt.

**Cursor (LOW): collectGrep truncation off-by-one**

`collectGrep` forwarded `maxResults` directly to `grepFilesInternal`,
which flipped `stats.truncated = true` the moment `matchesEmitted >=
maxResults`. So asking for `maxResults: 3` against a corpus with
exactly 3 matches (no more) falsely reported truncation.

Fix: mirror `collectGlob`'s `+1` overshoot pattern. Ask the iterator
for `maxResults + 1` internally; stop draining at `matches.length >=
maxResults`; only set `truncated = true` if we actually saw the
overshoot match. Also matches our own documented lore about this
pattern (which I violated — the reviewer's catch).

## Test plan

- [x] `bun run typecheck` — clean
- [x] `bun run lint` — clean (1 pre-existing warning)
- [x] `bun test test/lib test/commands test/types` — **5445 pass, 0 fail** (+2 regression tests)
- [x] `bun test test/isolated` — 138 pass
- [x] Neither fix affects the happy-path runtime (error branch never runs; truncation probe adds one iteration only when the corpus has strictly more than `maxResults` matches)
…onors opt-out

Two more findings from Cursor Bugbot on commit 5347702:

**Cursor (MEDIUM): producer errors swallowed on consumer break**

`mapFilesConcurrentStream` placed the `throw producerError` after its
`try/finally` block. When a consumer `break`s early, the runtime
calls the generator's `return()`, which runs `finally` but does NOT
execute any statement after the try block. Producer errors were
therefore silently lost on the break path (though visible on
drain-to-completion).

Verified with a standalone repro:
```js
async function* gen() {
  try { yield 1; yield 2; } finally { console.log("finally ran"); }
  throw new Error("dead code");
}
for await (const x of gen()) { if (x === 1) break; }
// → "finally ran" prints; error does NOT surface.
```

Fix: move the rethrow inside the `finally` block. Added a regression
test that reverts cleanly — when the fix is removed, the test fails
with `caught === null`.

**Cursor (LOW): `multiline: false` silently overridden**

`readAndGrep` unconditionally applied `/m` via
`ensureGlobalMultilineFlags`, making the `GrepOptions.multiline`
option a no-op. Caller passing `multiline: false` expecting
buffer-boundary anchoring got line-boundary anchoring instead.

Root cause: the whole-buffer `matchAll` rewrite in PR 3.5 needed `/m`
to preserve grep's line-boundary semantics (`^foo` matches any line
starting with `foo`), but I hardcoded it rather than making the
default explicit.

Fix: the `multiline` option now defaults to `true` (grep-like) and
honors `multiline: false` when explicitly set (buffer-boundary JS
semantics). Updated JSDoc in `types.ts` to document the new contract.

Added two regression tests:
- default: `^foo` matches mid-file lines (grep-like)
- `multiline: false`: `^foo` only matches at buffer start

- [x] `bun run typecheck` — clean
- [x] `bun run lint` — clean (1 pre-existing warning)
- [x] `bun test test/lib test/commands test/types` — **5448 pass, 0 fail** (+3 from this commit)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large` — no perf regression (per-op: scan.grepFiles 313ms, scan.walk.dsnParity 212ms, scan.ignore 1.74ms, detectDsn.cold 5.85ms, scanCodeForFirstDsn 4.24ms, scanCodeForDsns 365ms)
@BYK BYK force-pushed the byk/feat/ripgrep branch from 8e98ddd to a09d765 Compare April 21, 2026 10:37
…sult

Follow-up to Cursor Bugbot finding on the rebased commit: `processEntry`
returned `[]` for files with no DSNs, which `mapFilesConcurrent` pushes
into its results array and fires `onResult` for (as a no-op over an
empty array). On a 10k-file walk where ~99% of files contain no DSN,
this accumulates ~9900 wasted `onResult` invocations and pushes ~9900
empty arrays into the otherwise-unused results buffer.

Fix: return `null` from `processEntry` when the file contributes
nothing (no raw matches, all matches rejected by host validation, or
the file was unreadable). `mapFilesConcurrent` already filters `null`
results before invoking `onResult` and before pushing to results, so
the empty-work path is fully skipped.

This also tidies a parallel subtlety: when `detected.length === 0`
(raw matches existed but host validation rejected all of them), we
now skip sourceMtimes recording AND the onResult call, consistent with
"no DSNs here". Previously we'd push an empty array and record no
mtime but still fire onResult — inconsistent signal.

Type change: `processEntry(...)` now returns `Promise<DetectedDsn[] | null>`.
Private function, no external API impact.

## Test plan

- [x] `bun run typecheck` — clean
- [x] `bun run lint` — clean (1 pre-existing warning)
- [x] `bun test test/lib/dsn/` — **150 pass, 0 fail** (DSN semantics unchanged: 4 DSNs, 76 sourceMtimes, 461 dirMtimes on medium fixture)
- [x] `bun test test/lib/scan/` — 153 pass (scan module untouched)
Comment thread src/lib/scan/grep.ts Outdated
Comment thread src/lib/scan/grep.ts
Two low-severity findings from Cursor Bugbot on commit 5bb480c:

**`filesRead` counter incremented before read succeeded**

`grepFilesInternal`'s per-file callback bumped `opts.stats.filesRead`
*before* `readAndGrep` executed. `GrepStats` documents the field as
"files whose content was read and tested against the pattern" but
the actual count included failed opens (`readAndGrep` returns `null`
on read error). Fixed: increment only after a non-null return.

**`grepFiles` and `collectGrep` duplicated pipeline setup**

Both entry points had ~15 lines of identical code for pattern
compilation, matcher compilation, default resolution, walker
construction, and filter wiring. Any future change to defaults
needed to be applied to both; a divergence between them would be a
silent correctness bug. Extracted to a shared `setupGrepPipeline()`
helper that returns a `GrepPipelineSetup` bundle consumed by both
entry points.

`collectGrep` still manages its own `+1 probeLimit` override and the
collector-side truncation flag after the shared setup runs.

## Test plan

- [x] `bun run typecheck` — clean
- [x] `bun run lint` — clean (1 pre-existing warning)
- [x] `bun test test/lib test/commands test/types` — **5464 pass, 0 fail**
- [x] `bun test test/lib/scan/` — 153 pass (same)
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7bddf97. Configure here.

Comment thread src/lib/scan/ignore.ts
let ignored = rootResult.ignored && !rootResult.unignored;
if (rootResult.unignored) {
ignored = false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant conditional in isIgnored Tier 2 path

Low Severity

In the Tier 2 path of isIgnored, lines 216-218 (if (rootResult.unignored) { ignored = false; }) are dead code. Line 215 already computes ignored = rootResult.ignored && !rootResult.unignored, which correctly handles every combination: when unignored is true, the && with !unignored already yields false. The subsequent if block can never change the value of ignored.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7bddf97. Configure here.

@BYK BYK merged commit 78fe7a1 into main Apr 21, 2026
28 checks passed
@BYK BYK deleted the byk/feat/ripgrep branch April 21, 2026 11:22
BYK added a commit that referenced this pull request Apr 21, 2026
## Summary

Two regex-level optimizations to narrow the perf gap with ripgrep on our
pure-TS `collectGrep`/`grepFiles`. Follow-up to PR #791 and #797.

- **Literal prefilter** — ripgrep-style: extract a literal substring
from the regex source (e.g., `import` from `import.*from`), scan the
buffer with `indexOf` to locate candidate lines, only invoke the regex
engine on lines that contain the literal. V8's `indexOf` is roughly
SIMD-speed; skipping the regex engine on non-candidate lines is where
most of the win comes from.
- **Lazy line counting** — swapped `charCodeAt`-walk for `indexOf("\n",
cursor)` hops. 2-5× faster on the line-counting sub-loop because V8
implements `indexOf` in C++ without per-iteration JS interop.

## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core)

| Op | Before | After | Δ |
|---|---:|---:|---:|
| `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | **−14%** |
| `detectAllDsns.cold` | 363 ms | **313 ms** | **−14%** |
| `detectDsn.cold` | 7.73 ms | **5.61 ms** | **−27%** |
| `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | **−27%** |
| `scanCodeForDsns` | 342 ms | 333 ms | −3% (noise-equivalent) |
| `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | **−21%** |

The DSN workloads improve because `DSN_PATTERN` extracts `http` as its
literal — most source files don't contain `http` at all, so the
prefilter short-circuits before the regex runs.

No regressions on any benchmark. Pure-literal patterns (e.g.,
`SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-buffer
path unchanged.

## What changed

### New file: `src/lib/scan/literal-extract.ts` (~300 LOC)

Conservative literal extractor. Walks a regex source looking for the
longest contiguous run of literal bytes that every match must contain.
Bails out safely on top-level alternation, character classes, groups,
lookarounds, quantifiers, and escape classes.

Handles escaped metacharacters intelligently: `Sentry\.init` yields
`Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields
`foo` (escape `\b` is an anchor, not a literal `b`).

Exports:
- `extractInnerLiteral(source, flags)` — returns the literal, or null if
no safe extraction possible. Honors `/i` by lowercasing.
- `isPureLiteral(source, flags)` — true when the pattern IS a bare
literal with no metacharacters. Used by the grep pipeline to route
pure-literals to the whole-buffer path (V8's regex engine is
hyper-optimized for pure-literal patterns; the prefilter adds overhead
without benefit there).

### Modified: `src/lib/scan/grep.ts` (~240 LOC changes)

Three-way dispatch in `readAndGrep` based on the extracted literal:

1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal +
`multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs
the regex engine only on those. This is the main perf win.
2. **`grepByWholeBuffer`** — existing path, used for:
   - Pure-literal patterns (V8 handles them optimally)
- Patterns with no extractable literal (complex regex, top-level
alternation)
   - `multiline: false` mode (the fast path requires per-line semantics)

Also: replaced the `charCodeAt`-walk that counted newlines char-by-char
with an `indexOf("\n", cursor)` hop loop. Extracted `buildMatch(ctx,
bounds)` as a shared helper to bundle the match-construction arguments.

### Tests added

- `test/lib/scan/literal-extract.test.ts` — **39 tests** covering the
extractor's rules (escape handling, quantifier drop, alternation bail,
case-insensitive, minimum length).
- `test/lib/scan/grep.test.ts` — **7 new tests** for the prefilter fast
path: correctness vs whole-buffer, escaped-literal extraction,
case-insensitive flag, zero-literal-hit short-circuit, routing of pure
literals to whole-buffer, and alternation routing.

## Why this approach

From the ripgrep research (attached to PR #791): rg's central perf trick
is extracting a literal from each regex and prefiltering with SIMD
memchr. V8 doesn't expose SIMD directly but its
`String.prototype.indexOf` is compiled to a tight byte-level loop with
internal SIMD on x64 — functionally equivalent for our use case.

Three of the five techniques in the Loggly regex-perf guide were
evaluated:
- **Character classes over `.*`** — `DSN_PATTERN` already uses
`[a-z0-9]+`, no change needed.
- **Alternation order** — `DSN_PATTERN`'s `(?:\.[a-z]+|:[0-9]+)` is
already correctly ordered (`.` more common than `:` in DSN hosts);
swapping regressed perf by noise.
- **Anchors/word boundaries** — adding `\b` to `DSN_PATTERN` *regressed*
perf 2.8× on our workload. V8's existing fast character-mismatch
rejection on the first byte outperforms the boundary check overhead.

The remaining gap with rg is now primarily orchestration overhead
(async/await, `mapFilesConcurrent`, walker correctness features) rather
than regex speed. A worker-pool exploration may follow.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in
`src/lib/formatters/markdown.ts` unrelated to this PR)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5610 pass, 0 fail** (+58 new)
- [x] `bun test test/isolated` — 138 pass, 0 fail
- [x] `bun run bench --size large --runs 5` — all scan ops at or below
previous baseline
- [x] Manually verified semantic parity: `collectGrep` returns identical
`GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the
prefilter fires

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BYK added a commit that referenced this pull request Apr 21, 2026
…#805)

## Summary

Walker was 411ms p50 for 10k files on the synthetic/large fixture —
roughly 10× slower than a naive sync walk (50ms). Follow-up to PR #791
and #804.

Five targeted optimizations in the hot loop, all micro-benchmarked
before landing:

| Change | Measured win | End-to-end win |
|---|---|---|
| `readdirSync` instead of `readdir` | 105→45ms (−60ms) | dominant |
| `statSync` instead of `Bun.file().size` | −10ms per 10k files | |
| String concat for paths | path.join 7ms → concat 0.7ms (10×) | |
| `abs.slice(cwdPrefixLen)` for relPath | path.relative 9ms → slice
0.8ms (11×) | |
| Manual `lastIndexOf`-based extname | 9ms → 7ms (25%) | |

### Perf results (synthetic/large, 10k files, p50)

| Op | Before | After | Δ |
|---|---:|---:|---:|
| `scan.walk` | 411ms | **231ms** | **−44%** |
| `scan.walk.noExt` | 572ms | 448ms | **−22%** |
| `scan.walk.dsnParity` | 228ms | **138ms** | **−39%** |
| `scanCodeForDsns` | 323ms | 304ms | −6% |
| `detectAllDsns.cold` | 327ms | 308ms | −6% |
| `detectAllDsns.warm` | 27.9ms | 27.0ms | — |
| `scan.grepFiles` | 322ms | 316ms | noise |
| `scanCodeForFirstDsn` | 2.23ms | 2.05ms | — |

Walker ops are 22-44% faster. Downstream ops (grep, DSN scanner) benefit
less because their time is dominated by content scanning, not walking —
but still show consistent ~6% improvements.

## Why `readdirSync` is safe here

The sync vs async tradeoff usually favors async because blocking the
event loop is bad in general. But measured per-call cost matters:

| Metric | `readdirSync` on 3635 dirs |
|---|---|
| p50 | 11µs |
| p95 | 24µs |
| p99 | 35µs |
| max | 65µs |

65µs max block is trivial — `setTimeout(0)` latency in Node is ~4ms.
Blocking for 65µs never causes noticeable event-loop pauses. And we pay
~60µs of microtask overhead for each async readdir, which wipes out any
theoretical fairness benefit. Net: 2-3× faster per-dir on walks with
many small directories, which is every realistic CLI workload.

If this ever matters for a weird embedded use case, the optimization is
trivially reversible — the walker's public API is unchanged.

## mtime parity fix

Initial implementation regressed `detectAllDsns.warm` from 28ms → 304ms
because `statSync().mtimeMs` is a float (e.g. `1776790602458.1033`)
while `Bun.file().lastModified` is already an integer. The DSN cache
validator compares floored `sourceMtimes`, so un-floored floats caused
cache misses on every warm call. Fixed by flooring explicitly in
`tryYieldFile` — matches the same treatment already applied to
`onDirectoryVisit`'s dirMtimes.

## Walker v2 design notes

- **Sync readdir + async yield.** Readdir blocks briefly (~65µs max),
but the surrounding async-generator interface is preserved. Library
consumers that insert async work between iterations still get
cooperative scheduling.
- **Path ops inlined.** `path.join` / `path.relative` / `path.extname` +
`toLowerCase` are all replaced with manual string ops in the hot loop.
On Windows, `normalizePath` is still applied via `replaceAll(NATIVE_SEP,
"/")` — the POSIX fast path uses a cached `POSIX_NATIVE` module
constant.
- **`WalkContext.cwdPrefixLen`** precomputed once per walk, used to
slice relative paths from absolute paths.
- **Cached module constants** `NATIVE_SEP` and `POSIX_NATIVE` avoid
per-call `path.sep` property lookups in the hot loop.

## What this PR does NOT change

- `WalkEntry` shape: preserved identically (absolutePath, relativePath,
size, mtime, isBinary, depth).
- `WalkOptions` contract: no new options, no semantics changes.
- Symlink handling, `followSymlinks`, cycle detection via
`visitedInodes`: unchanged.
- IgnoreStack + gitignore semantics: unchanged.
- `onDirectoryVisit`, `recordMtimes`, `abortSignal`: unchanged.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in
`src/lib/formatters/markdown.ts` unrelated)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5640 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun test test/lib/scan/walker.test.ts` — 34 pass (incl. property
tests covering hidden files, symlinks, maxDepth, gitignore interaction)
- [x] `bun test test/lib/dsn/code-scanner.test.ts` — 52 pass (incl.
dirMtimes / sourceMtimes cache validation)
- [x] DSN count correctness verified end-to-end: 4 DSNs found on fixture
(matches pre-change count)
- [x] `bun run bench --size large --runs 5 --warmup 2` — results in
table above

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant